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Abstract 

Background: Escherichia coli \s an important species of bacteria that can live as a harmless inhabitant of the guts of 
many animals, as a pathogen causing life-threatening conditions or freely in the non-host environment. This diversity 
of lifestyles has made it a particular focus of interest for studies of genetic variation, mainly with the aim to understand 
how a commensal can become a deadly pathogen. Many whole genomes of E. coli have been fully sequenced in the 
past few years, which offer helpful data to help understand how this important species evolved. 

Results: We compared 27 whole genomes encompassing four phylogroups of Escherichia coli (A, Bl, B2 and E). From 
the core-genome we established the clonal relationships between the isolates as well as the role played by 
homologous recombination during their evolution from a common ancestor. We found strong evidence for sexual 
isolation between three lineages (A+Bl, B2, E), which could be explained by the ecological structuring of £ coli and 
may represent on-going speciation. We identified three hotspots of homologous recombination, one of which had 
not been previously described and contains the aroC gene, involved in the essential shikimate metabolic pathway. 
We also described the role played by non-homologous recombination in the pan-genome, and showed that this 
process was highly heterogeneous. Our analyses revealed in particular that the genomes of three 
enterohaemorrhagic (EHEC) strains within phylogroup Bl have converged from originally separate backgrounds as a 
result of both homologous and non-homologous recombination. 

Conclusions: Recombination is an important force shaping the genomic evolution and diversification of £ coli, both 
by replacing fragments of genes with an homologous sequence and also by introducing new genes. In this study, 
several non-random patterns of these events were identified which correlated with important changes in the lifestyle 
of the bacteria, and therefore provide additional evidence to explain the relationship between genomic variation and 
ecological adaptation. 



Background 

Recombination is a fundamental process of bacterial evo- 
lution, capable of influencing the integrity of species 
[1-3]. Two types of recombination are typically distin- 
guished: homologous recombination, where a fragment of 
a genome is replaced by the corresponding sequence from 
another genome [4], and non-homologous recombination, 
which causes genetic additions of new material and is also 
called lateral gene transfer (LGT) [5]. These two types 
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of recombination may in fact often happen simultane- 
ously, but they are usually studied separately because of 
the very diff^erent signatures they produce on the genomic 
sequences. Both homologous and non-homologous types 
of recombination are key elements of the evolution of 
bacteria and can be linked to variations in fitness, and 
thus ecologies and lifestyles. There is indeed an ecolog- 
ical component in bacterial recombination, in the sense 
that bacteria with overlapping living environments, reser- 
voirs or hosts (i.e., "overlapping ecologies") will have more 
opportunities for genetic exchange than species or lin- 
eages living in drastically distinct environments. Recom- 
bination is therefore clearly conditioned by ecology, but 
conversely it is probable that recombination often drives 
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ecological changes, for example by allowing favourable 
innovations to be exchanged by separate lineages adapting 
to a same lifestyle [3,6] . 

Escherichia coli is a good example of an environmentally 
versatile and adaptable bacterial species. It encompasses 
some strains able to live commensally with their host and 
others causing a relatively wide variety of disease symp- 
toms, from diarrhoea or renal failure to meningitis [7] . On 
top of this commensal versus pathogen duality, which may 
not represent a strict categorization, E, coli can be found 
in a wide range of hosts, as well as secondary non-host 
environments such as water, soils or plants [8,9], in which 
it sometimes seems to maintain very well [10-13]. At the 
phylogenetic level, this plasticity is somewhat reflected by 
the population structure of E, coli, which is characterised 
by the presence of distinct phylogenetic groups (or "phy- 
logroups") observable by phylogenetic reconstruction [14] 
or the use of specific markers [15]. Four major (A, Bl, 
B2 and D) and two minor (E and F) phylogroups have so 
far been described [14,16]. Judging from the non-random 
isolation frequencies of diff'erent phylogroups in various 
hosts and environments [8,9,17], it seems that the fitness 
in different environments varies among E. coli isolates 
from different phylogroups, which raises the question of 
the evolutionary nature of these phylogroups. Are they the 
present reflection of E. coli subgroups undergoing speci- 
ation as a consequence of slightly variable ecologies? Or, 
the primary environment of any E. coli being the gastroin- 
testinal tract of endotherms, is there a relative cohesion of 
these phylogroups within the E. coli species after all? An 
indirect but efficient method to answer these questions is 
to look at the patterns of recombination (homologous and 
non-homologous) between different strains and mem- 
bers of the different phylogroups. As mentioned above, 
recombination should be conditioned by existing ecolog- 
ical differences between lineages, and may even be partly 



responsible for them in which case this approach also has 
the potential to identify the genes that play a key role in 
the adaptation. 

In this study, we contribute to the understanding of 
the association between genomic evolution and ecological 
adaptation by presenting bioinformatic analyses of recom- 
bination events (gene gain/loss and homologous recombi- 
nation) between 27 publicly available genomes of E. coli 
from different phylogroups (A, Bl, B2 and D) and ecolog- 
ical backgrounds (commensal and different pathotypes). 
More generally, our extensive knowledge about E. coli 
compared to other microbial species provides a unique 
opportunity to study the mechanisms of genomic evolu- 
tion in its biological context. We used a genomic analyt- 
ical pipeline (summarized in Figure 1) which combined 
progressiveMauve [18] for aligning the genomes, Clonal- 
Frame [19] to establish their clonal relationships with one 
another, GenoPlast [20] to study non-homologous recom- 
bination and ClonalOrigin [21] to examine homologous 
recombination. 

Methods 

Genome sequences 

A total of 30 genomes of E. coli were available from the 
NCBI reference sequence database [22] when this study 
was initiated. Three of these genomes (UMN026 [23], 
IAI39 [23] and SMS-3-5 [24]) were described as mem- 
bers of phylogroup D but did not cluster together in 
our preliminary phylogenetic analysis (Additional file 1: 
Figure SI). Furthermore, these three genomes showed 
evidence of deviation in the molecular clock rate 
which could have confused the analyses presented here 
since the models in ClonalFrame [19] and ClonalOrigin 
[21] assume a constant clock rate (Additional file 1: 
Figure SI). These three genomes were therefore excluded 
so that we were left with a set of 27 genomes which is 
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Figure 1 Genomic analytical pipeline used in this study. The genomes are first aligned using Mauve, and tine core-genome is used to estimate 
tine clonal genealogy using ClonalFrame. Non-core regions are then interpreted in terms of non-homologous recombination events on the 
branches of the clonal genealogy using GenoPlast, whereas core regions are analyzed using ClonalOrigin to infer homologous recombination 
events and their origins on the clonal genealogy. 
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summarized in Table 1. Several more genomes have 
recently become available on NCBI, but the complex 
analytical pipeline we used (Figure 1) could not easily 
accommodate them. 

Multi-locus sequence typing data 

To assess the representativeness of the 27 strains included 
in this study, we compared them with the isolates from the 
E, coli reference collection (ECOR) [44] which have been 
characterized by two independent Multi-Locus Sequence 
Typing [45,46] schemes. Fragments of 450-550bp from 
seven housekeeping genes {adky fumC, gyrBy led, mdh, 
recA and purA) have been sequenced previously for a 
total concatenated length of 3423bp [47]. Additional frag- 
ments of 450-600bp from eight genes (dinB, icdA, pabB, 
polBy putP, trpA, trpB and uidA) have subsequently been 
sequenced for a total concatenated length of 4095bp [16]. 
To achieve maximum robustness, we combined the data 



from both studies to obtain 7518bp of sequence from each 
isolate. BLAST [48] was used to extract the sequences 
of each of the 15 gene fragments from each of the 27 
genomes. A UPGMA dendrogram was then constructed 
to illustrate the phylogenetic relationship between the 
genomes and the ECOR collection (Figure 2). 

Analysis of genomic content 

The genomes of the 27 strains in Table 1 were aligned 
using progressiveMauve [18,49,50]. progressiveMauve 
does not use annotations to guide the alignment. Conse- 
quently, when there are multiple copies of a gene in the 
genome, progressiveMauve will usually align the copy that 
fits best in the context of surrounding sequence, unless 
the identity to a sequence in a different context scores so 
much better that it exceeds the breakpoint penalty. In gen- 
eral this will have the effect of aligning orthologous copies 
of genes unless the gene conversion rate among paralogs is 
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Figure 2 Representativeness of the genomes used. Phylogenetic relationships between tlie 27 genomes in tliis study (labels in red) and the 
ECOR reference collection (labels in black). Colors correspond to clade designations as follows: clade A in red, Bl in green, B2 in yellow, E in blue, D in 
cyan and F in mauve. 



very high. The resulting alignment contained 2675 locally 
colinear blocks (LCBs). For all subsets of the genomes with 
cardinality ranging from 1 to 27, the concatenated size of 
the homologous regions found in all or a fraction of the 
subset was counted directly from the output of progres- 
siveMauve. These values were used to generate Figure 3. 
Furthermore, for each pair of strains, a pairwise distance 
was computed representing the proportion of genome 
content that they have in common. This matrix of pair- 
wise distances was then used to build the UPGMA tree in 
Figure 4B. The cophenetic correlation coefficient [51] for 
this tree was 0.89 indicating that it is a fairly good repre- 
sentation of the differences in genomic content between 
the genomes. 

From the complete alignment of the 27 genomes, a 
matrix of feature presence/absence was computed using 
the bbFilter script distributed with Mauve, where each 
feature represented 50bp of unique sequence. This data 
was analyzed using GenoPlast [20] which infers how the 
genomic composition of the genomes evolved on the 
branches of the clonal genealogy (computed as explained 



in the next paragraph) assuming a model in which gain 
and loss of genetic material follow a relaxed molecular 
clock [52]. Briefly, GenoPlast explores the space of gain 
and loss events happening on branches that are compat- 
ible with the observed patterns of sharing of genomic 
regions observed in the genomes at the leaves of the 
tree. GenoPlast was run for 2,000,000 iterations with the 
first half discarded as burn-in. Good convergence and 
mixing properties were found by comparing different 
runs. The results of the GenoPlast analysis are shown in 
Figure 5. 

Reconstruction of clonal genealogy 

All regions of at least 500bp found in all 27 genomes 
were extracted from the progressiveMauve output 
using the stripSubsetLCBs script distributed with 
Mauve. A total of 765 such regions were found, rang- 
ing in size from 501bp to 27,115bp with a mean 
of 4322bp and a concatenated length of 3.3Mbp. 
These regions found in the 27 genomes represent the 
core-genome of £. coli (Figure 3). We applied ClonalFrame 
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Figure 3 Core and pan genome cumulative plot. Concatenated length of the regions found in all (red) and at least one (blue) genome as more 
and more of the 27 genomes are aligned against altogether. 



[19] to this core-genome in order to reconstruct the 
clonal relationships between the genomes. ClonalFrame 
is a Bayesian phylogenetic method which performs infer- 
ence under an evolutionary model accounting for the 
effect of homologous recombination [19,53,54]. Five runs 
of ClonalFrame were performed independently each con- 
sisting of 100,000 iterations, the first half of which was 
discarded as burn-in. The results were compared between 



runs and found to be highly similar, indicating good 
convergence and mixing properties. The clonal geneal- 
ogy inferred by ClonalFrame is shown in Figure 4A. The 
analyses of homologous and non-homologous recombi- 
nation described below were performed conditionally on 
this clonal genealogy. Consequently, the fact that some 
genomes are more closely related to one another than 
others is fully accounted for in these analyses. 



Bl 



I — °ATCC8739 

1 FBL21 A 

^REL606 
9K-12/BW2952 

iK-12/DHlOB 

tK-12/MG1655 
iK-12/W3110 

J olAll 

^SEll 

J I ° 55989 

J' o 12009 

I °E24377A 

p 11128 

^11368 
JEC4115 
JtW14359 
1EDL933 
— iSakai 
^CB9615 
pAPECOI 

tuTI89 

^588 

CFT073 

°ED1A 

°536 

□ 



B2 



B 



Bl 




11128 
11368 
12009 
ATCC8739 
HS 

K-12/MG1655 
K-12/W3110 A 
K-12/BW2952 f\ 
K-12/DH10B 
BL21 
REL606 
lAll 
SEll 
E24377A 
55989 
536 

CFT073 
APECOl 
S88 
UTI89 
EDIA 

EDL933 
Sakai 
EC4115 
TW14359 
CB9615 



Bl 



B2 



Figure 4 Genealogies based on homology and gene content. (A) ClonalFrame result based on core-genome. (B) UPGMA dendrogram based on 
similarity of genomic content. Colors correspond to clade designations as follows: clade A in red, Bl in green, B2 in yellow and E in blue. 
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Analysis of homologous recombination 

In order to further analyse the role played by homologous 
recombination during the diversification of E. coli from 
a common ancestor, we applied the computer software 
ClonalOrigin [21] which performs approximate inference 
under the coalescent with gene-conversion model [55,56]. 
ClonalOrigin detects recombination events, including 
their origin and destination on the clonal genealogy, and 
can therefore be used to reconstruct trends and patterns 
of homologous recombination [21,57]. The ClonalOrigin 
model rests on three global parameters which are the aver- 
age length of recombination events 8 and the scaled rates 
of mutation and recombination events respectively equal 



to Os = 2Nefjis and ps = 2Ner where Ne is the effective 
population size, /x is the per site per generation mutation 
frequency and r is the per site per generation recombina- 
tion frequency. A first run of ClonalOrigin was performed 
for each of the 765 core regions where each region inde- 
pendently infers the three parameters (this phase is called 
"Step 2" in [21]). The median values of the three parame- 
ters across all regions were as follows: (5=542bp, ^^=0.0125 
and P5 =0.0128. ClonalOrigin was then rerun for each 
region with the three parameters set equal to these esti- 
mates (this phase is called "Step 3" in [21]). In both steps, 
ClonalOrigin was run for 2,000,000 iterations, the first half 
of which was discarded as burn-in. 



Didelot etal. BMC Genomics 2012, 13:256 
http://www.bionnedcentral.conn/l 471 -21 64/1 3/256 



Page 7 of 15 



Step 2 was only used to infer the values of the three 
global parameters, and all results presented here are based 
on the Step 3 results from ClonalOrigin. For instance, 
Figure 6 represents the number of recombination bound- 
aries found in each of the 965 regions, with three hotspots 
(defined as contiguous regions of the genome in which 
the average recombination rate across alignment blocks is 
significantly higher than elsewhere in the genome) high- 
lighted in grey. Figures 7 and 8 compare the number of 
inferred recombination events between different parts of 
the genealogy with the number expected under the prior 
model. These two figures are based on the numbers of 
the observed and expected recombination events com- 
puted by ClonalOrigin for all pairs of potential donor and 
recipient branches of clonal genealogy. These values are 
compiled in Additional file 2: Table SI. 

Results and discussion 

Representativeness of the strains used in this study 

This study included 27 previously sequenced genomes of 
Escherichia coli (Table 1). To assess how representative 
these genomes are of the global diversity of the species, we 
compared them to the Escherichia coli reference collection 
(ECOR) [44] on the basis of two Multi-Locus Sequence 
Typing schemes which together spanned a total of 15 
genes [16,47]. The resulting phylogeny (Figure 2) high- 
lighted the six previously described lineages of E, coli, 



designated A, Bl, B2, E, D and F [14,16]. Overall, the 27 
strains covered much of the diversity of E. coli, with eight 
strains in clade A, seven in clade Bl, five in clade E and 
seven in clade B2 (Figure 2; Table 1). In each of these four 
clades, the strains seem to represent much of the within- 
clade diversity rather than being closely related within 
the clade. However, two clades were not represented in 
this genomic panel: clade D and clade F. Three genomes 
from these phylogroups (IAI39 [23], SMS-3-5 [24] and 
UMN026 [23]) were initially intended to be included, but 
were removed because they showed evidence for signifi- 
cant deviation from the assumption of a fixed molecular 
clock (Additional file 1: Figure SI). Figure 2 indicates how 
the diversity of the genomes in this study relates with that 
of the ECOR strains, however, it should be noted that the 
issue of biased sampling of bacterial isolates is frequent 
and it is never possible to be sure of the representativeness 
of a sample [4,58]. 

Reconstruction of the clonal genealogy 

Aligning the 27 genomes using progressiveMauve [18,49, 
50] allowed us to compare their genomic content. As more 
genomes are considered in the analysis, the cumulative 
size of genomic regions shared by them decreased down 
to about 3.3Mbp, or about two thirds of each genome 
(Figure 3). Since this length is roughly constant whether 
10, 15, 20 or all 27 genomes are aligned (Figure 3), these 




0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 



Genomic position on E. coli K-12/MG1655 ^ 

Figure 6 Intensity of recombination along the genome. Scatter plot where each cross represents a genomic region found in all 27 genomes. 
The X-axis indicates the positionof the region in the reference genome K-1 2 MGl 655 [30] and the Y-axis is a measure of the intensity of 
recombination inferred by ClonalOrigin. Three hotspots of recombination are highlighted in grey. 
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regions are likely to represent the core-genome of E. coli, 
which means that homologs of these regions would be 
found in virtually any sequenced genome. We found 765 
core regions present in the genomes of all 27 strains, 
with total length 3,306,899bp. These core regions were 
input into ClonalFrame [19] in order to estimate the 
clonal genealogy in a way that accounts for homologous 
recombination which can confuse the signal of clonal 
inheritance [59]. This aspect is important because recom- 
bination has been reported to be frequent in E, coli by 
a large number of previous studies [14,16,47,60,61]. The 
inferred clonal genealogy (Figure 4A) consisted of four 
clades corresponding to A, Bl, B2 and E. The relationships 
between genomes within clades were fully resolved, which 
is typically not achievable with MLST (eg. [14,16]). The 
relationships between clades were unbalanced, with clade 
A and Bl most closely related to each other, and clade 
B2 most distant from any other clade. The stemminess 
(ie. the ratio of internal to external branch lengths) of this 
tree was compatible with expectation under the standard 
coalescent model (Additional file 3: Figure S2), suggest- 
ing no evidence for population size variation during the 
evolution of £. coli [62-64]. 

Analysis of the dispensable genome 

In contrast to the core regions described above, non-core 
regions are found only in a strict subset of the genomes. 
The set of non-core regions is called the dispensable 



genome and together with the core genome forms 
the pan-genome [65-67]. The cumulative length of the 
non-core regions continues to increase up to the 27th 
genome, showing no sign of flattening, with each new 
genome adding about 250Kbp of previously unobserved 
sequence (Figure 3). This distribution has been observed 
before, including in E. coli, and its pan-genome has con- 
sequently been called "open" [23,65-67]. However, an 
important difference between these previous studies and 
ours is that in Figure 3 the lengths of genomic material 
are measured directly whereas previous studies counted 
the number of genes. Our analysis is therefore robust to 
the problem of identifying homologous families of genes. 
Nevertheless, this result indicates that the pan-genome 
of species with a high diversity and ecological plastic- 
ity such as E, coli draws from a large repertoire of genes 
that can be gained and lost through lateral gene transfer 
[5,67]. 

The similarity of the genomes in terms of genomic con- 
tent was calculated from the patterns of presence and 
absence of non-core regions (Figure 4B). Compared with 
the clonal genealogy (Figure 4A), the clade structure is 
only partly preserved in this tree of genome content, with 
clades B2 and E intact but clades A and Bl intermin- 
gled. Clade Bl was split into three parts which were per- 
fectly congruent with pathotypes. The three EHEC strains 
12009, 11368 and 11128 [33] formed one separate cluster. 
The two commensal strains I All [23] and SEll [32] and 
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Figure 8 Detailed flux of recombination. Heat map showing the number of recombination events inferred by ClonalOrigin relative to its 
expectation under the prior model for each donor/recipient pair of branches. Cells for which both the number of observed and expected events are 
less or equal than three are shown in light gray. 



the only ETEC strain E24377A [26] constituted another 
separate cluster, in which the two commensal strains were 
closest to each other. Finally, the EAEC strain 55989 [23] 
was on a separate branch in spite of its close relationship 
with the commensal strains I All and SEll in the clonal 
genealogy (Figure 4A). This subdivision of Bl in terms 
of genomic content has been partially hinted at before 
[68] and the fact that it is congruent with pathotypes sug- 
gests that it is linked with differences in ecological and 
pathogenic lifestyles. The presence or absence of genomic 
regions in the 27 observed genomes is the result of a pro- 
cess of gain and loss of content by the ancestors of the 
genomes since their evolution from a common ancestor. 
If gain and loss happened randomly and at constant rates, 
the tree based on genomic content (Figure 4B) would be 



expected to to be very similar to the tree based on homol- 
ogy of the core-genome (Figure 4A) since the evolution 
of both core and pan genomes would then follow the 
same molecular clock. The two trees were however highly 
different, indicating that the non-homologous recombi- 
nation process (gain and loss of regions) did not follow 
a strict molecular clock. GenoPlast [20] was used to 
infer the non-homologous recombination events that hap- 
pened in the context of the clonal genealogy inferred by 
ClonalFrame (Figure 4A) under a model where the rates 
of gain and loss are allowed to change. The results of 
the GenoPlast analysis are shown in Figure 5, with differ- 
ences in the rates of gain and loss on specific branches 
spanning two orders of magnitude. The rates of gain and 
(to a lesser extent) loss of genomic material were found 
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to be higher on the short recent branches within clades 
A, E and B2 than on older and longer branches, which 
explained the higher stemminess of the genomic con- 
tent tree (Figure 4B) compared with the clonal genealogy 
(Figure 4A). 

The branch directly above EHEC strain 12009 had the 
largest amount of gain of any branch (1405 Kbp) whereas 
the branch above the common ancestor of the other two 
Bl EHEC strains 11368 and 11128 was the highest amount 
of gain for an internal branch (788 Kbp; with the excep- 
tion of the very long branch above clade E). Amongst the 
genomic material gained on these two branches, 265 Kbp 
were shared by the three genomes, which explained why 
they clustered together in Figure 4B. The distribution of 
this gain on the three genomes (Additional file 4: Figure 
S3) indicated that their convergence in genomic content 
happened as a result of multiple gain events that happened 
both on the branch above 12009 and on the branch above 
the common ancestor of 11368 and 11128. The conver- 
gence in genomic content of the three EHEC Bl strains 
was therefore reciprocal rather than unidirectional. Few 
convergence events were found on the branches directly 
above 11368 and 11128 (Additional file 4: Figure S3) in 
spite of considerable gain on these branches (1009Kbp 
and 645Kbp respectively), which could indicate that the 
convergence in gene content with 12009 is not on-going. 
Unsurprisingly, this convergence involved several genes 
known to be EHEC determinants, including Shiga toxins 
[69] and all genes from the locus of enterocyte efface- 
ment (or LEE [70]). However, it also included additional 
genes, such as flagellar genes (fli [71]) and a few metabolic 
clusters (frl [72] 2ind gal [73]) with a notable presence of 
genes involved in aromatic compounds metabolism {hpay 
hpc, mhp and mhp [74]). These genes were not present 
in the other Bl strains examined in this study, which may 
indicate that acquiring EHEC determinants via HGT is an 
important means of E, coli adaptation, possibly enhanced 
by the differences in host-associated selective pressures on 
EHEC compared to commensals or more opportunistic 
pathotypes. 

Homologous recombination hotspots in Escherichia coli 

To quantify the propensity, genomic distribution and 
directionality of homologous recombination during the 
evolution of E, coli, we applied ClonalOrigin [21] to 
the 765 core regions and assuming the clonal relation- 
ships between genomes estimated by ClonalFrame [19] 
in Figure 4A. The average length of fragments involved 
in homologous recombination was estimated at 5=542bp. 
This is almost ten times higher than a previous esti- 
mate in E, coli [23], but is of the same order as recent 
whole-genome estimates in Bacillus cereus [21], Heli- 
cobacter pylori [75] or Chlamydia trachomatis [76] . The 
relative rate of occurrence of recombination and mutation 



[77] was estimated at Ps/Os = 0.0128/0.0125 = 1.024 
which means that overall recombination happened just 
as frequently as mutation. The estimated rate of homol- 
ogous recombination was fairly constant throughout the 
genome (Figure 6), with the exception of three clear 
hotspots (highlighted in grey) in which recombination 
rates were significantly higher. This included two large 
regions around the rjb operon involved in synthesis of 
the O antigen (positions 2,020 to 2,190 Kbp in the ref- 
erence genome K-12/MG1655 [30]) and around the fimA 
gene (positions 4,420 to 4,620 Kbp). These two regions 
had been reported previously as hotspots of diversity and 
recombination [23,78]. 

A smaller recombination hotspot was also detected, 
made of just two nearly adjacent core regions (between 
positions 2,442 and 2,447 Kbp). This region had a sim- 
ilarly high recombination rate as the two regions above, 
but had not previously been detected as a hotspot, per- 
haps because of its small size (around 5 Kbp). This hotspot 
contained genes yfcL, yfcM, yfcA, mepA, aroQ prmB and 
smrB. The gene mepA encodes for a murein endopep- 
tidase [79] whose role is presumably to restructure the 
bacterial cell wall during elongation or stabilise the pep- 
tidoglycan. Mutational analyses on mepA [79,80] do not 
provide enough information to explain why recombina- 
tion should be high for this gene. In the bacterial cell, 
aroC governs the synthesis of chorismate, a key precur- 
sor to the biosynthesis of aromatic compounds including 
the amino acids tryptophan and phenylalanine but also 
the siderophore enterobactin. The positive maintenance 
of a functional allele of aroC is arguably crucial for the cell 
to maintain appropriate levels of these amino acids and 
siderophores in natural conditions. In Salmonella [81], as 
well as in Brucella suis [82], aroC is required for viru- 
lence. Incidentally, aroC is a common target to produce 
knocked-out attenuated vaccine strains [83], for instance 
in Salmonella serovars Typhi [84-86] and Typhimurium 
[81,85,87,88], pathogenic £. coli [89], Brucella suis [82], 
Burkholderia pseudomallei [90] or Edwardsiella tarda 
[91]. To our knowledge, this is the first mention of aroC 
being part of a recombination hotspot, giving additional 
clues on evolutionary dynamics at this locus. Depending 
on how aroC is involved with virulence in E, coli, it may be 
under selective pressure from the immune system of the 
host, which could explain the observed peak in recombi- 
nation rate [4], but this hypothesis will need further work 
to be fully assessed. 

Flux of homologous recombination 

The numbers of recombination events inferred by Clon- 
alOrigin were counted for every combination of clades 
receiving and donating, and these values were compared 
with their expectation under the ClonalOrigin model 
which represents a close approximation to the coalescent 
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model with gene conversion [21,55,56]. This comparison 
revealed significant non-uniformity in the homologous 
recombination flux within and between clades (Figure 7). 
The three clades A, Bl and B2 had higher numbers 
of within-clade recombination than expected, whereas 
clade E had almost exactly the expected number. On 
the other hand, the number of recombination events 
detected between clades was almost systematically below 
its expected value, with the only exception being recom- 
bination from clade A to Bl and vice- versa which had 
slightly higher than expected values. Clades A and Bl 
are the two most closely related phylogroups (Figure 4A) 
which may contribute to explain this observation. Over- 
all, inter-phylogroup recombination fluxes were lower 
than intra-phylogroup ones, which is compatible with the 
hypothesis that there is a preferred way of gene sharing 
within phylogroups [92]. This preferred exchange among 
strains of the same phylogroups could be explained by 
the possibility that the different E. coli phylogroups have 
slightly distinct ecological overlaps, which makes the like- 
lihood of gene transfer higher among them than between 
them. 

A similar analysis as above was performed on a branch- 
by-branch basis rather than a clade-by-clade basis 
(Figure 8), the only added difficulty being that some 
donor/recipient pairs of branches have too low num- 
bers of expected and observed recombination events for 
the comparison to be meaningful (represented in grey 
in Figure 8). This analysis confirmed the general pattern 
described above, with more recombination than expected 
within-clades and between A and Bl, and less recombina- 
tion between all other clades. However, it also allowed the 
comparison of the individual behaviours of strains belong- 
ing to the same clade. For instance, strains BL21 and 
REL606 [27] showed little history of importing recombi- 
nation from clade Bl, contrasting with ATCC8739 [25] 
or HS [26] even though all four strains belong to clade 
A. This may be explained by the fact that these two 
strains are laboratory-adapted derived from E, coli strain 
B [27,93], so that they would have had little or no oppor- 
tunity for recent encounter and recombination with Bl 
strains.The four K-12 strains in this study [28-31] were 
also laboratory-adapted, but had terminal branches too 
small to reliably estimate deviations in the number of 
recombination events. These four strains all originated 
from bioengineering manipulation on K-12 lineages over 
the last century and therefore harbour a very limited num- 
ber of differences between them compared to what would 
be observed in natural populations. 

The Bl strains 11128 and 11368 [33] showed signifi- 
cantly less sign of import from clade A (and to a lesser 
extent from Bl) than other strains of Bl. This observation 
implies that these EHEC strains have stopped recombin- 
ing with strains of clade A (which are all commensals) 



as they adapted to this new pathogenic lifestyle. Two of 
the highest values throughout Figure 8 were the ones 
corresponding to imports from strains 11128 and 11386 
into strain 12009 [33]. As previously noted, these are the 
only three EHEC strains in Bl, and these three genomes 
have been converging in genomic content due to numer- 
ous non-homologous recombination events. This result 
indicates that the three genomes also have an extensive 
history of convergence through homologous recombina- 
tion, which may have occurred at the same time as the 
gain of new shared genes. The evolutionary history of 
these three genomes seems therefore analogous to that 
of Salmonella serovars Typhi and Paratyphi A, for which 
both core and pan genomes converged through recom- 
bination as they were progressively adapting to exclusive 
infection of the human host [6]. 

Speciation in E coll 

In the analysis of homologous recombination described 
above, three groups corresponding respectively to phy- 
logroups E, B2, and A+Bl exhibited more recombination 
within than between one another (Figures 7 and 8). This 
pattern is compatible with a definition of speciation in 
bacteria in which recombination plays the role of a cohe- 
sive force counterbalancing divergence by genetic drift 
and population structure, and where species appear when 
this force is weakened between lineages [1,2,94]. Under 
such a model, patterns of genetic diversity can be gen- 
erated in silico similar to those observed for example 
in Salmonella enterica [95,96]. The three groups might 
therefore represent lineages that, because of slightly dis- 
tinct ecologies or notable variations in the species life 
cycle, have gradually diverged too far from one another for 
recombination to play its cohesive role, so that they might 
eventually become separate species, should these varia- 
tions remain or increase. In other words, all E, coli phylo- 
genetic backgrounds are found in the gut of endotherms 
[14] which is their primary environment, and to some 
extent in nonhost secondary environments [8,9] but it 
sounds plausible that phylogroup-associated variations in 
ecological fitness in different hosts or secondary environ- 
ments could gradually decrease the physical and ecologi- 
cal overlap of strains from different phylogroups through 
time, and therefore the genetic flux between them. A 
number of studies seem to support this hypothesis, as dif- 
ferent proportions of the different phylogroups are found 
in different environments and hosts [8,9,97,98]. Addi- 
tionally, some phylogroups seem to harbour strains that 
have either host-restricted or more generalist lifestyles 
[99], as well as strains that are either resident or tran- 
sients in their ability to colonize the gut [100]. Our study 
contributes to highlight that such variations in ecology 
could potentially have an impact on genetic exchange in 
E, coli. 



Didelot et al. BMC Genomics 201 2, 1 3:256 
http://www.bionnedcentral.conn/l 471 -21 64/1 3/256 



Page 12 of 15 



An additional number of factors can be evoked to 
explain why the three groups would be diverging, includ- 
ing differences in their geographic distribution, adaptative 
selection, or simply as a result of the dependence of 
recombination on homology between donor and recipient 
[4,94,96,101,102]. The three groups are clearly separate in 
terms of genomic content (Figure 4B) which could explain 
why they recombine less with each other and why clades 
A and Bl still recombine frequently since they are not 
differentiated in terms of genomic content. To test this 
hypothesis, we compared the distribution of the num- 
ber of recombination events found in the middle and at 
the edge of core regions (Additional file 5: Figure S4). 
We found that recombination happened more often in 
the middle of core regions at a small but highly signifi- 
cant level (Kolmogorov-Smirnov test; p-value=8.8e-09). If 
the variable genomic content is not just a random pro- 
cess, then homologous recombination would be expected 
to happen less often around these genes, a concept some- 
times called fragmented speciation or "species in pieces" 
[103-106] as it would predict that speciation can apply 
differentially across the genome. Our results therefore 
demonstrate that fragmented speciation applies to E. coli, 
and that difference in genomic content is at least one of 
the factors driving the divergence of the three lineages. 

Conclusions 

We applied a pipeline of statistical analyses in order to 
compare the sequences of 27 E. coli genomes and reveal 
the ancestral history of clonal relationships, homologous 
recombination events and non-homologous recombina- 
tion events that has led the ancestor of E, coli to diver- 
sify into the genomes we see today. The overall picture 
was one of divergence between three lineages (A+Bl, B2, 
E) which were well differentiated on the basis of both 
genomic content and preference for homologous recom- 
bination, with the former apparently driving the latter as 
expected under a fragmented speciation scenario. How- 
ever, against this divergence background, we observed the 
convergence of three EHEC strains within Bl in both their 
core- and pan-genomes. These observations were corre- 
lated with the diversity of ecology and pathogenicity of the 
E, coli strains, and provide hypotheses for which genes and 
evolutionary processes are adaptively important. 



recombination events for all pairs of donor and recipient branches, as 
computed by ClonalOrigin. This is the data on which Figure 7 is based. 
There is a cell for each donor/recipient combination, and the cells are 
ordered vertically and horizontally in the same way as in . In each cell, two 
values are given separated by a semi-colon: the first one is the observed 
value and the second one is the expected value. 

Additional file 3: Figure S2. Test of ancestral population size dynamics. 
Distribution of expected values of stemminess under the coalescent 
model. The observed value for the clonal genealogy estimated by 
ClonalFrame is shown as a vertical line and falls within the expected values. 

Additional file 4: Figure S3. Gain in the three genomes 1 2009, 1 1 368 
and 1 1 1 28. The genomic regions gained by the three genomes 1 2009, 
1 1 368 and 1 1 1 28 are colored. The regions in red are the ones that are 
uniquely shared by the three genomes, whereas the regions in green are 
not. For genome 1 2009, only the gain happening on the branch directly 
above is shown. For genomes 1 1 368 and 1 1 1 28, the gain on the branches 
directly above are shown using lighter green and red, and the gain that 
happened on the branch above the common ancestor of 1 1 368 and 1 1 1 28 
is shown using darker green and red. 

Additional file 5: Figure S4. Test of the fragmented speciation model. 
Boxplots of the distributions of the numbers of recombination events 
found in the middle (left) and at the edge of core regions (right). To 
generate the distribution on the left, the number of recombination events 
affecting the middle position was counted for each of the 765 core 
regions. To generate the distribution on the right, the number of 
recombination events affecting the site lObp after the beginning of each 
core region was counted, as well as the number of recombination events 
affecting the site 1 Obp before the end of each core region. 
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Additional file 1 : Figure SI . Test of molecular clock assumption. 
Neighbour-joining phylogenetic reconstruction based on all 30 genomes 
available from NCBI and which shows that three of them (UMN026, IAI39 
and SMS-3-5) showed significant deviation from the assumption of 
constant molecular clock. 

Additional file 2: Table SI . Detailed results of the ClonalOrigin analysis. 
This table contains all expected and observed values of the number of 



References 

1 . Achtman M, Wagner M: Microbial diversity and the genetic nature of 

microbial species. Nat Rev Microbiol 2008, 6:431-440. 

2. Eraser C, Aim EJ, Polz MF, Spratt BG, Hanage WP: The bacterial species 
challenge: making sense of genetic and ecological diversity. 
Science 2009, 323(591 5):741 -746. 

3. Sheppard S, McCarthy N, Falush D, Maiden M: Convergence of 
Campylobacter species: implications for bacterial evolution. 
Science 2008, 320(5873):237-239. 

4. Didelot X, Maiden MC: Impact of recombination on bacterial 
evolution. Trends Microbiol 201 0, 1 8:31 5-322. 



Didelot etal. BMC Genomics 201 2, 13:256 Page 1 3 of 1 5 

http://www.biomedcentral.eom/1 471 -21 64/1 3/256 



5. Ochman H, Lawrence JG, Groisman EA: Lateral gene transfer and the 25. 
nature of bacterial innovation. Nature 2000, 405(6784):299-304. 

6. Didelot X, Achtman M, Parkhill J, Thomson N, Falush D: A bimodal 
pattern of relatedness between the Salmonella Paratyphi A and 
Typhi genomes: Convergence or divergence by homologous 
recombination? Genome Res 2007, 1 7:61-68. 26. 

7. Croxen MA, Finlay BB: Molecular mechanisms of Escherichia coli 
pathogenicity. Nat Rev Microbiol 201 0, 8:26-38. 

8. Walk S, Aim E, Calhoun L, Mladonicky J, Whittam T: Genetic diversity 
and population structure of Escherichia coli isolated from 

freshwater beaches. Environ Microbiol 2007, 27. 
9(9):2274-2288. 

9. Bergholz PW, Noar JD, Buckley DH: Environmental patterns are 
imposed on the population structure of Escherichia coli after fecal 
deposition. AppI Environ Microbiol 201 1 , 77:21 1-219. 28. 

1 0. Ishii S, Ksoll WB, Hicks RE, Sadowsky MJ: Presence and growth of 
naturalized Escherichia coli in temperate soils from Lake Superior 
watersheds. AppI Environ Microbiol 2006, 72:61 2-62 1 . 

1 1 . Texier S, Prigent-Combaret C, Gourdon MH, Poirier MA, Faivre P, Dorioz 29. 
JM, Poulenard J, Jocteur-Monrozier L, Moenne-Loccoz Y, Trevisan D: 
Persistence of culturable Escherichia coli fecal contaminants in 

dairy alpine grassland soils. J Environ Qual 2008, 37(6): 
2299-2310. 

1 2. Brennan FP, Abram F, Chinalia FA, Richards KG, O'Flaherty V: 30. 
Characterization of environmentally persistent Escherichia coli 

isolates leached from an Irish soil. AppI Environ Microbiol 201 0, 

76(7):21 75-21 80. 

1 3. Brennan FP, O'Flaherty V, Kramers G, Grant J, Richards KG: Long-term 31 . 
persistence and leaching of Escherichia coli in temperate maritime 

soils. AppI Environ Microbiol 201 0, 76(5):1 449-1 455. 

14. Tenaillon 0, Skurnik D, Picard B, Denamur E:The population genetics 
of commensal Escherichia coli. Nat Rev Microbiol 2010, 8(3):207-217. 

1 5. Clermont 0, Bonacorsi S, Bingen E: Rapid and simple determination 32. 
of the Escherichia coli phylogenetic Group. AppI Environ Microbiol 

2000, 66(10):4555-4558. 

1 6. Jaureguy F, Landraud L, Passet V, Diancourt L, Frapy E, Guigon G, 
Carbonnelle E, Lortholary 0, Clermont 0, Denamur E, Picard B, Nassif X, 

Brisse S: Phylogenetic and genomic diversity of human bacteremic 33. 
Escherichia coli strains. BMC Genomics 2008, 9:560-560. 

1 7. Skurnik D, Bonnet D, Bernede-Bauduin C, Michel R, Guette C, Becker JM, 
Balaire C, Chau F, Mohler J, Jarlier V, Boutin JP, Moreau B, Guillemot D, 
Denamur E, Andremont A, Ruimy R: Characteristics of human 

intestinal Escherichia coli with changing environments. Environ 34. 
Microbiol 2008, 10(8):21 32-21 37. 

1 8. Darling A, Mau B, Perna N: progressiveMauve: Multiple genome 
alignment with gene gain, loss and rearrangement. PLoS one 201 0, 35. 
5(6):e11147. 

1 9. Didelot X, Falush D: Inference of bacterial microevolution using 
multilocus sequence data. Genetics 2007, 1 75(3):1 251-1 266. 

20. Didelot X, Darling A, Falush D: Inferring genomic flux in bacteria. 

Genome Res 2009, 1 9:306-3 1 7. 

21 . Didelot X, Lawson D, Darling A, Falush D: Inference of homologous 
recombination in bacteria using whole-genome sequences. 36. 

Genetics 2010, 186(4):1 435-1 449. 

22. Pruitt KD, Tatusova T, Klimke W, Maglott DR: NCBI Reference 
Sequences: current status, policy and new initiatives. Nucleic Acids 
Res 2009, 37(Database issue):32-36. 

23. Touchon M, Hoede C, Tenaillon 0, Barbe V, Baeriswyl S, Bidet P, Bingen E, 
Bonacorsi S, Bouchier C, Bouvet 0, Calteau A, Chiapello H, Clermont 0, 37. 
Cruveiller S, Danchin A, Diard M, Dossat C, Karoui ME, Frapy E, Garry L, 

Ghigo JM, Gilles AM, Johnson J, Le Bouguenec C, Lescat M, Mangenot S, 
Martinez-Jehanne V, Matic I, Nassif X, Oztas S, Petit MA, Pichon C, Rouy Z, 
Ruf CS, Schneider D,Tourret J, Vacherie B, Vallenet D, Medigue C, Rocha 
EP, Denamur E: Organised genome dynamics in the Escherichia coli 
species results in highly diverse adaptive paths. PLoS Genet 2009, 38. 
5:e 1000344. 

24. Fricke WF, Wright MS, Lindell AH, Harkins DM, Baker-Austin C, Ravel J, 
Stepanauskas R: Insights into the environmental resistance gene 39. 
pool from the genome sequence of the multidrug-resistant 
environmental isolate Escherichia coli SMS-3-5. J Bacteriol 2008, 
190(20):6779-6794. 



Copeland A, Lucas S, Lapidus A, Glavina del Rio T, Dalin E, Tice H, Bruce 
D, Goodwin L, Pitluck S, Kiss H, Brettin T, Detter J, Han C, Kuske C, 
Schmutz J, Larimer F, Land M, Hauser L, Kyrpides N, Mikhailova N, Ingram 
L, Richardson P: Complete sequence of Escherichia coli C str. ATCC 
8739. [http://www.ncbi.nlm.nih.g0v/nucleotide/l 69752989]. 
Rasko DA, Rosovitz MJ, Myers GS, Mongodin EF, Fricke WF, Gajer P, 
Crabtree J, Sebaihia M, Thomson NR, Chaudhuri R, Henderson IR, 
Sperandio V, Ravel: The pangenome structure of Escherichia coli: 
comparative genomic analysis of E. coli commensal and 
pathogenic isolates. J Bacteriol 2008, 1 90(20):688 1 -6893. 
Jeong H, Barbe V, Lee CH, Vallenet D, Yu DS, Choi SH, Couloux A, Lee SW, 
Yoon SH, Cattolico L, Hur CG, Park HS, Segurens B, Kim SC, Oh TK, Lenski 
RE, Studier FW, Daegelen P, Kim JF: Genome sequences of Escherichia 
coli B strains REL606 and BL21 (DE3). J Mol Biol 2009, 394(4):644-652. 
Ferenci T, Zhou Z, Betteridge T, Ren Y, Liu Y, Feng L, Reeves PR, Wang L: 
Genomic sequencing reveals regulatory mutations and 
recombinational events in the widely used MC4100 lineage of 
Escherichia coli K-M. J Bacteriol 2009, 191(12):4025-4029. 
DurfeeT, Nelson R, Baldwin S, Plunkett G, Bxurland V, Mau B, Petrosino 
JF, Qin X, Muzny DM, Ayele M, Gibbs RA, Csorgo B, Posfai G, Weinstock 
GM, Blattner FR: The complete genome sequence of Escherichia coli 
DH 1 08: insights into the biology of a laboratory workhorse. J 
Bacteriol 2008, 190(7):2597-2606. 

Blattner FR, Plunkett G, Bloch CA, Perna NT, Bxurland V, Riley M, Collado- 
Vides J, Glasner JD, Rode CK, Mayhew GF, Gregor J, Davis NW, Kirkpatrick 
HA, Goeden MA, Rose DJ, Mau B, Shao Y: The complete genome 
sequence of Escherichia coli K-1 2. Science 1 997, 277(533 1 ):1 453-1 474. 
Riley M, Abe T, Arnaud MB, Berlyn MK, Blattner FR, Chaudhuri RR, Glasner 
JD, Horiuchi T, Keseler IM, Kosuge T, Mori H, Perna NT, Plunkett G, Rudd 
KE, Serres MH, Thomas GH, Thomson NR, Wishart D, Wanner BL: 
Escherichia coli K-1 2: a cooperatively developed annotation 
snapshot. Nucleic Acids Res 2006, 34:1 -9. 

Oshima K, Toh H, Ogura Y, Sasamoto H, Morita H, Park SH, Ooka T, lyoda 
S, Taylor TD, Hayashi T, Itoh K, Hattori M: Complete genome sequence 
and comparative analysis of the wild-type commensal Escherichia 
coli strain SE1 1 isolated from a healthy adult. DNA Res 2008, 
15(6):375-386. 

Ogura Y, Ooka T, Iguchi A, Toh H, Asadulghani M, Oshima K, Kodama T, 
Abe H, Nakayama K, Kurokawa K, Tobe T, Hattori M, Hayashi T: 
Comparative genomics reveal the mechanism of the parallel 
evolution of 0157 and non-0157 enterohemorrhagic Escherichia 
coli. Proc Natl Acad Sci USA 2009, 106(42):1 7939-1 7944. 
Eppinger M, Mammel MK, Leclerc JE, Ravel J: Cebula TA : Genomic 
anatomy of Escherichia coli 01 57:H7 outbreaks. Proc Natl Acad Sci U 
S/\2011,108(50):20142-20147. 

Kulasekara BR, Jacobs M, Zhou Y, Wu Z, Sims E, Saenphimmachak C, 
Rohmer L, Ritchie JM, Radey M, McKevitt M, Freeman TL, Hayden H, 
Haugen E, Gillett W, Fong C, Chang J, Beskhlebnaya V, Waldor MK, 
Samadpour M, Whittam TS, Kaul R, Brittnacher M, Miller SI: Analysis of 
the genome of the Escherichia coli 0157:H7 2006 
spinach-associated outbreak isolate indicates candidate genes 
that may enhance virulence. /nfecf /mm un 2009, 77(9):371 3-3721. 
Perna NT, Plunkett G, Bxurland V, Mau B, Glasner JD, Rose DJ, Mayhew 
GF, Evans PS, Gregor J, Kirkpatrick HA, Posfai G, Hackett J, KlinkS, Boutin 
A, Shao Y, Miller L, Grotbeck EJ, Davis WN, Lim A, Dimalanta ET, 
Potamousis KD, Apodaca J, Anantharaman TS, Lin J, Yen G, Schwartz DC, 
Welch RA, Blattner FR: Genome sequence of enterohaemorrhagic 
Escherichia coli 0157:H7. Nature 2001, 409(681 9):529-533. 
Hayashi T, Makino K, Ohnishi M, Kurokawa K, Ishii K, Yokoyama K, Han CG, 
Ohtsubo E, Nakayama K, Murata T, Tanaka M, Tobe T, lida T, Takami H, 
Honda T, Sasakawa C, Ogasawara N, Yasunaga T, Kuhara S, Shiba T, 
Hattori M, Shinagawa H: Complete genome sequence of 
enterohemorrhagic Escherichia coli 0157:H7 and genomic 
comparison with a laboratory strain K-1 2. DNA Res 2001 , 8:1 1 -22. 
Zhou Z, Li X, Liu B, Beutin L, Xu J, Ren Y, Feng L, Lan R, Reeves PR, Wang 
L: Derivation of Escherichia coli 0157:H7from its 055:H7 
precursor. PLoS One 201 0, 5:e8700. 

Johnson TJ, Kariyawasam S, Wannemuehler Y, Mangiamele P, Johnson 
SJ, Doetkott C, Skyberg JA, Lynne AM, Johnson JR, Nolan LK: The 
genome sequence of avian pathogenic Escherichia coli strain 
01:K1:H7 shares strong similarities with human extraintestinal 



Didelot era/. BMC Genomics 201 2, 13:256 
http://www.biomedcentral.eom/1 471 -2 1 64/1 3/256 



Page 14 of 15 



pathogenic E. coli genomes. J fiacter/o/ 2007, 189(8): 
3228-3236. 

40. Chen SLL, Hung CSS, Xu J, Reigstad CSS, Magrini V, Sabo A, Blasiar D, Bieri 
T, Meyer RRR, Ozersky P, Armstrong JRR, Fulton RSS, Latreille JPP, Spieth 
J, Hooton TMM, Mardis ERR, Hultgren SJJ, Gordon Jll: Identification of 
genes subject to positive selection in uropathogenic strains of 
Escherichia coli: A comparative genomics approach. Proc Natl Acod 
SciUSA 2006, 1 03(1 5):5 977-5 982. 

41 . Welch RA, Bxurland V, Plunkett G, Redford P, Roesch P, Rasko D, Buckles 
EL, Liou SR, Boutin A, Hackett J, Stroud D, Mayhew GF, Rose DJ, Zhou S, 
Schwartz DC, Perna NT, Mobley HL, Donnenberg MS, Blattner FR: 
Extensive mosaic structure revealed by the complete genome 
sequence of uropathogenic Escherichia coli . Proc Natl Acad SciUSA 
2002, 99(26):1 7020-1 7024. 

42. Brzuszkiewicz E, Bruggemann H, Liesegang H, Emmerth M, OlschlagerT, 
Nagy G, Albermann K, Wagner C, Buchrieser C, Emody L, Gottschalk G, 
Hacker J, Dobrindt U: How to become a uropathogen: comparative 
genomic analysis of extraintestinal pathogenic Escherichia coli 
strains. Proc Natl Acad Sci U S A 2006, 103(34): 

12879-12884. 

43. Iguchi A, Thomson NR, Ogura Y, Saunders D, OokaT, Henderson IR, Harris 
D, Asadulghani M, Kurokawa K, Dean P, Kenny B, Quail MA, Thurston S, 
Dougan G, Hayashi T, Parkhill J, Frankel G: Complete genome sequence 
and comparative genome analysis of enteropathogenic Escherichia 
coli 01 27:H6 strain E2348/69. J Bacteriol 2009, 191 :347-354. 

44. Ochman H, Selander RK: Standard reference strains of Escherichia 
coli from natural populations. J toter/o/ 1984, 157(2):690-693. 

45. Maiden MCJ, Bygraves JA, Fell E, Morelli G, Russell JE, Urwin R, Zhang Q, 
Zhou J, Zurth K, Caugant DA, Feavers IM, Achtman M, Spratt BG: 
Multilocus sequence typing: A portable approach to the 
identification of clones within populations of pathogenic 
microorganisms. PNAS 1 998, 95 (6) :3 140-3 145. 

46. Maiden MC: Multilocus sequence typing of bacteria. Annu Rev 
Microbiol 2006, 60:561-588. 

47. Wirth T, Falush D, Lan R, Colles F, Mensa P, Wieler LH, Karch H, Reeves PR, 
Maiden MC, Ochman H, Achtman M: Sex and virulence in Escherichia 
coli: an evolutionary perspective. Mol Microbiol 2006, 

60(5):1 136-1 151. 

48. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman 
DJ: Gapped BLAST and PSI-BLAST: a new generation of protein 
database search programs. Nucleic Acids Res 1 997, 25(1 7):3389-3402. 

49. Darling AC, Mau B, Blattner FR, Perna NT: Mauve: multiple alignment 
of conserved genomic sequence with rearrangements. Genome Res 
2004, 14(7):1 394-1 403. 

50. Darling AE, Treangen TJ, Messeguer X, Perna NT: Analyzing patterns of 
microbial evolution using the mauve genome alignment system. 
Methods Mol Biol (Clifton, NJ.) 2007, 396:1 35-1 52. 

51 . Sokal R, Rohlf F: The comparison of dendrograms by objective 
methods. Taxon 1 962, 1 1 (2):33-40. 

52. Huelsenbeck JP, Larget B, Swofford D: A compound poisson process 
for relaxing the molecular clock. Genetics 2000, 154(4):1 879-1 892. 

53. Didelot X, Falush D: Bacterial Reconnbination in vivo. Horizontal Gene 
Transfer in the Evolution of Pathogenesis : Cambridge University Press; 
2008. 

54. Didelot X: Sequence-based analysis of bacterial population structure. 
Bacterial Population Genetics in Infectious Disease : Wiley Press; 2010. 

55. Wiuf C, Hein J: The coalescent with gene conversion. Genetics 2000, 
155:451-462. 

56. Didelot X, Lawson D, Falush D: SimMLST: simulation of multi-locus 
sequence typing data under a neutral model. Bioinformatics 2009, 
25(1 1):1 442-1 444. 

57. Cadillo-Quiroz H, Didelot X, Held NL, Herrera A, Darling A, Reno ML, 
Krause DJ, Whitaker RJ: Patterns of gene flow define species of 
thermophilic Archaea. PLoS Biol 201 2, 1 0(2):e1 001 265. 

58. Eraser C, Hanage WP, Spratt BG: Neutral microepidemic evolution of 
bacterial pathogens. Proc Natl Acad Sci U S A 2005, 102(6):1 968-1 973. 

59. Schierup MH, Hein J: Consequences of recombination on traditional 
phylogenetic analysis. Genetics 2000, 156(2):879-891 . 

60. Guttman DS, Dykhuizen DE: Clonal divergence in Escherichia coli as a 
result of recombination, not mutation. Science 1 994, 

266(51 89):1 380-1 383. 



61 . Vos M, Didelot X: A comparison of homologous recombination rates 
in bacteria and archaea. ISMEJ 2009, 3(2):1 99-208. 

62. Fiala KL, Sokal RR: Factors determining the accuracy of cladogram 
estimation - evaluation using computer-simulation. Evolution 1 985, 

39:609-622. 

63. den Bakker, H, Didelot X, Fortes E, Nightingale K, Wiedmann M: Lineage 
specific recombination rates and microevolution in Listeria 
monocytogenes. BMC Evolutionary Biol 2008, 8:277. 

64. Didelot X, Urwin R, Maiden MCJ, Falush D: Genealogical typing of 
Neisseria meningitidis. Microbiology 2009, 155:31 76-31 86. 

65. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, 
Angiuoli SV, Crabtree J, Jones AL, Durkin AS, Deboy RT, Davidsen TM, 
Mora M, Scarselli M, Margarit y, Ros I, Peterson JD, Hauser CR, Sundaram 
JP, Nelsonm WC, Madupu R, Brinkac LM, Dodson RJ, Rosovitz MJ, Sullivan 
SA, Daugherty SC, Haft DH, Selengut J, Gwinn ML, Zhou L, Zafar N, Khouri 
H, Radune D, Dimitrov G, Watkins K, O'Connor KJ, Smith S, UtterbackTR, 
White 0, Rubens CE, Grandi G, Madoff LC, Kasper DL, Telford JL, Wessels 
MR, Rappuoli R, Eraser CM: Genome analysis of multiple pathogenic 
isolates of Streptococcus agalactiae: implications for the microbial 
"pan-genome". Proc Natl Acad Sci U S A 2005, 102(39):1 3950-1 3955. 

66. Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R: The microbial 
pan-genome. CurrOpin Genet Dev 2005, 15(6):589-594. 

67. Tettelin H, Riley D, Cattuto C, Medini D: Comparative genomics: the 
bacterial pan-genome. CurrOpin Microbiol 2008, 1 1 (5):472-477. 

68. Sims GE, Kim SH: Whole-genome phylogeny of Escherichia 
coli/Shigella group by feature frequency profiles (FFPs). Proc Natl 
Acad Sci USA 201 1, 108(20):8329-8334. 

69. Boerlin P, McEwen SA, Boerlin-Petzold F, Wilson JB, Johnson RP, Gyles CL: 
Associations between virulence factors of Shiga toxin-producing 
Escherichia coli and disease in humans. J Clin Microbiol 1 999, 
37(3):497-503. 

70. McDaniel TK, Jarvis KG, Donnenberg MS, Kaper JB: A genetic locus of 
enterocyte effacement conserved among diverse enterobacterial 
pathogens. Proc Natl Acad Sci U S A 1 995, 92(5):1 664-1 668. 

71 . Malakooti J, Ely B, Matsumura P: Molecular characterization, 
nucleotide sequence, and expression of the fliO, fliP, fliQ, and fliR 
genes of Escherichia coWJ Bacteriol 1 994, 1 76:1 89-1 97. 

72. Wiame E, Delpierre G, Collard F, Van Schaftingen E: Identification of a 
pathway for the utilization of the Amadori product fructoselysine 
in Escherichia coW J Biol Chem 2002, 277(45):42523-42529. 

73. Weickert MJ, Adhya S: The galactose regulon of Escherichia coli. 
Mol Microbiol 1 993, 10(2):245-251 . 

74. Diaz E, Ferrandez A: Biodegradation of aromatic compounds by 
Escherichia coli. Microbiol Mol Biol Rev 2001, 65(4):523-569. 

75. Kennemann L, Didelot X, Aebischer T, Kuhn S, Drescher B, Droege M, 
Reinhardt R, Correa P, Meyer TF, Josenhans C, Falush D, Suerbaum S: 
Helicobacter pylori genome evolution during human infection. 
Proc Natl Acad SciUSA 201 1, 108(12):5033-5038. 

76. Joseph SJ, Didelot X, Gandhi K, Dean D, Read TD: Interplay of 
recombination and selection in the genomes of Chlamydia 
trachomatis. Biol Direct 201 1 , 6:28-28. 

77. Milkman R, Bridges MM: Molecular evolution of the Escherichia coli 
Chromosome. III. Clonal Frames. Genetics 1 990, 126:505-51 7. 

78. Milkman R, Jaeger E: Molecular evolution of the Escherichia coli 
chromosome. VI. Two regions of high effective recombination. 
Gener/cs 2003, 163(2):475-483. 

79. Keck W, van Leeuwen AM: Cloning and characterization of mepA, the 
structural gene of the penicillin-insensitive murein endopeptidase 
from Escherichia coli. Mol Microbiol 1 990, 4(2):209-21 9. 

80. lida K: Mutants of Escherichia coli defective in penicillin-insensitive 
murein DD-endopeptidase. Mol Gen Genet 1 983, 189(2):21 5-221 . 

81 . Dougan G, Chatfield S, Pickard D, Bester J: Construction and 
characterization of vaccine strains of Salmonella harboring 
mutations in two different aro genes. J Infect Dis 1 988, 
158(6):1 329-1 335. 

82. Foulongne V, Walravens K, Bourg G, Boschiroli ML, Godfroid J: Aromatic 
compound-dependent Brucella suis is attenuated in both cultured 
cells and mouse models. Infect Inn nnun 2001, 69:547-550. 

83. Roberts CW, Leroux MM, Fleming MD, Orkin SH: Highly penetrant, 
rapid tumorigenesis through conditional inversion of the tumor 
suppressor gene Snf5. Cancer Cell 2002, 2(5):41 5-425. 



Didelot et al. BMC Genomics 201 2, 1 3:256 Page 1 5 of 1 5 

http://www.bionnedcentral.conn/l 471 -21 64/1 3/256 



84. Jacket CO, Levine MM: CVD 908, CVD 908-htrA, and CVD 909 live 
oral typhoid vaccines: a logical progression. Clin Infect Dis 2007, 
45(Suppl l):20-23. 

85. Chatfield SN, Fairweather N, Charles I, Pickard D, Levine M, Hone D, 
Posada M: Construction of a genetically defined Salmonella typhi 
Ty2 aroA, aroC mutant for the engineering of a candidate oral 
typhoid-tetanus vaccine. Vaccine 1992, 10:53-60. 

86. Jacket CO, Sztein MB, Losonsky GA, Wasserman SS, Nataro JP, Edelman 
R, Pickard D, Dougan G: Safety of live oral Salmonella typhi vaccine 
strains with deletions in htrA and aroC aroD and immune response 
in humans. Infect Immun 1 997, 65(2):452-456. 

87. Khan LA, Khan SA, Al-Hateeti HS, Bhat AR, Bhat KS, Sheikh FS: Clinical 
profile and outcome of poisoning in Najran. Ann Saudi Med 2003, 
23(3-4):205-207. 

88. Hindle Z, Chatfield SN, Phillimore J, Bentley M, Johnson J, Cosgrove CA, 
Ghaem-Maghami M, Sexton A, Khan M, Brennan FR, Everest P, Wu T, 
Pickard D, Holden DW, Dougan G, Griffin GE, House D, Santangelo JD, 
Khan SA, Shea JE: Characterization of Salmonella enterica 
derivatives harboring defined aroC and Salmonella pathogenicity 
island 2 type III secretion system (ssaV) mutations by immunization 
of healthy volunteers. Infect Immun 2002, 70(7):3457-3467. 

89. Daley A, Randall R, Darsley M, Choudhry N, Thomas N, Sanderson IR: 
Genetically modified enterotoxigenic Escherichia coli vaccines 
induce mucosal immune responses without inflammation. Gut 
2007, 56(11):1550-1556. 

90. Srilunchang T, Proungvitaya T, Wongratanacheewin S: Construction 
and characterization of an unmarked aroC deletion mutant of 
Burkholderia pseudomallei strain A2. Southeast Asian J Trop Med 
Public Health 2009, 40:123-130. 

91 . Xiao J, Chen T, Wang Q, Liu Q, Wang X, Lv Y: Search for live attenuated 
vaccine candidate against edwardsiellosis by mutating 
virulence-related genes offish pathogen Edwardsiella tarda. Lett 
AppI Microbiol 201 1 , 53(4):430-437. 

92. Leopold S, Sawyer S: Obscured Phylogeny and Recombinational 
Dormancy in Escherichia coli. BMC Evolutionary Biol 201 1, 11:183. 

93. Daegelen P, Studier FW, Lenski RE, Cure S, Kim JF: Tracing ancestors 
and relatives of Escherichia coli B , and the derivation of B strains 
REL606 and BL21 (DE3). J Mol Biol 2009, 394(4):634-643. 

94. Eraser C, Hanage W, Spratt B: Recombination and the nature of 
bacterial speciation. Science 2007, 315(581 1):476-480. 

95. Falush D,Torpdahl M, Didelot X, Conrad DF: Mismatch induced 
speciation in Salmonella: model and data. Phil Trans R Soc B 2006, 
361:2045-53. 

96. Didelot X, Bowden R, Street!, GolubchikT, Spencer C, McVean G, Sangal 
V, Anjum ME, Achtman M, Falush D, Donnelly P: Recombination and 
population structure in Salmonella enterica. PLoS Genet 201 1 , 

7(7):el002191. 

97. Gordon DM: The genetic structure of Escherichia coli populations in 
primary and secondary habitats. Microbiology 2002, 148(Pt 
5):1513-1522. 

98. Gordon DM, Cowling A: The distribution and genetic structure of 
Escherichia coli in Australian vertebrates: host and geographic 
effects. Microbiology 2003, 149(Pt 1 2):3575-3586. 

99. White AP, Arnold PM, Norvell DC, Ecker E, Fehlings MG: Pharmacologic 
management of chronic low back pain: synthesis of the evidence. 
Spine (Phi la Pa 1 976) 20 1 1 , 36(Su ppl 2 1 ): 1 3 1 - 1 43. 

1 00. Nowrouzian EL: Escherichia coli strains belonging to phylogenetic 
group B2 have superior capacity to persist in the intestinal 
microflora of infants. J Infect Dis 2005, 1 91 (7): 1 078-1 083. 

101. Vulic M, Dionisio F, Taddei F, Radman M: Molecular keys to speciation: 
DNA polymorphism and the control of genetic exchange in 
enterobacteria. Proc Natl Acad Sci USA] 997, 94(1 8):9763-9767. 

1 02. Majewski J: Sexual isolation in bacteria. FEMS Microbiol Lett 2001 , 
199(2):161-169. 

1 03. Lawrence JG: Gene transfer in bacteria: speciation without species? 

Theor Popul Biol 2002, 61 (4):449-460. 

1 04. Retchless AC, Lawrence JG: Temporal fragmentation of speciation in 
bacteria. Science 2007, 31 7(5841 ):1 093-1 096. 

1 05. Lawrence JG, Retchless AC: The interplay of homologous 
recombination and horizontal gene transfer in bacterial 
speciation. Methods Mol Biol 2009, 532:29-53. 



1 06. Retchless AC, Lawrence JG: Phylogenetic incongruence arising from 
fragmented speciation in enteric bacteria. Proc Natl Acad Sci U S A 

2010, 107(25):1 1453-1 1458. 



doi:1 0.1 1 86/1 471 -21 64-1 3-256 

Cite this article as: Didelot et al.: Impact of homologous and non- 
homologous recombination in the genomic evolution of Escherichia coli. 

BMC Genomics 20]2 13:256. 



Submit your next manuscript to BioMed Central 
and take full advantage of: 

• Convenient online submission 

• Thorough peer review 

• No space constraints or color figure charges 

• Immediate publication on acceptance 

• Inclusion in PubMed, CAS, Scopus and Google Scholar 

• Research which is freely available for redistribution 



Submit your manuscript at 
www.biomedcentral.com/submit 



o 



BioMed Central 



